#install.packages("here") # This package helps with reproducbility and file paths
#install.packages("tidyverse") # This package will be used for data wrangling and hosts the majority of functions we will be using todayIntro to R and Tidyverse
Welcome to Introduction to R
In this workshop you will learn basic concepts, skills, and tools for working with ecological data - in the hopes to get more done in less time, and with less pain. The course has been specifically tailored to use common Bush Heritage data, but assumes little to no knowledge of R.
You will briefly learn R syntax and data formats, including the functionality of R projects, followed by data wrangling techniques and principles using the tidyverse R package. You will also learn how to plot data using the ggplot2 package. The course will culminate by developing data summary statistics and a simple linear model. Further model development and skills will be provided in the Advanced course for those interested.
Let’s get started!
Installing R and RStudio
Windows
Download R from the CRAN website.
Run the
.exefile that was just downloadedGo to the RStudio download page
Under All Installers, download the RStudio Installer for Windows.
Double click the file to install it
Once it’s installed, open RStudio to make sure it works and you don’t get any error messages.
MacOS
Download R from the CRAN website.
Select the
.pkgfile for the latest R versionDouble click on the downloaded file to install R
It is also a good idea to install XQuartz (needed by some packages)
Go to the RStudio download page
Under All Installers, download the RStudio Installer for MacOS.
Double click the file to install RStudio
Once it’s installed, open RStudio to make sure it works and you don’t get any error messages.
Linux
Follow the instructions for your distribution from CRAN, they provide information to get the most recent version of R for common distributions. For most distributions, you could use your package manager (e.g., for Debian/Ubuntu run
sudo apt-get install r-base, and for Fedorasudo yum install R), but we don’t recommend this approach as the versions provided by this are usually out of date. In any case, make sure you have at least R 3.3.1.Go to the RStudio download page
Under All Installers, select the version that matches your distribution and install it with your preferred method (e.g., with Debian/Ubuntu
sudo dpkg -i rstudio-YYYY.MM.X-ZZZ-amd64.debat the terminal).Once it’s installed, open RStudio to make sure it works and you don’t get any error messages.
Updating R and RStudio
If you already have R and RStudio installed, first check if your R version is up to date:
When you open RStudio your R version will be printed in the console on the bottom left. Alternatively, you can type
sessionInfo()into the console. If your R version is 4.0.0 or later, you don’t need to update R for this lesson. If your version of R is older than that, download and install the latest version of R from the R project website for Windows, for MacOS, or for LinuxIt is not necessary to remove old versions of R from your system, but if you wish to do so you can check How do I uninstall R?
Note: The changes introduced by new R versions are usually backwards-compatible. That is, your old code should still work after updating your R version. However, if breaking changes happen, it is useful to know that you can have multiple versions of R installed in parallel and that you can switch between them in RStudio by going to
Tools > Global Options > General > Basic.After installing a new version of R, you will have to reinstall all your packages with the new version. For Windows, there is a package called
installrthat can help you with upgrading your R version and migrate your package library.
To update RStudio to the latest version, open RStudio and click on Help > Check for Updates. If a new version is available follow the instruction on screen. By default, RStudio will also automatically notify you of new versions every once in a while.
R vs RStudio
RProjects
A project is simply a working directory designated with a .RProj file. When you open a project (using File/Open Project in RStudio or by double–clicking on the .Rproj file outside of R), the working directory will automatically be set to the directory that the .RProj file is located in.
I recommend creating a new R Project whenever you are starting a new research project. Once you’ve created a new R project, you should immediately create folders in the directory which will contain your R code, data files, notes, and other material relevant to your project (you can do this outside of R on your computer, or in the Files window of RStudio). For example, you could create a folder called R that contains all of your R code, a folder called data that contains all your data (etc.).
Creating an R project for each project you are working on facilitates organisation and scientific reproducibility.
An RStudio project allows you to more easily: 1. Save data, files, variables, packages, etc. related to a specific analysis project 2. Restart work where you left off 3. Collaborate, especially if you are using version control such as git
Let’s practice making an RProject to organise our code and data for the workshop today.
In RStudio click File–>New Project–>New Directory–> New Project.
Designate the name of your .Rproj and where you want this .RProj to live. This will become the location of your working directory. You should also see the name of your .RProj appear int he top right-hand conor of the RStudio IDE.
RMarkdown
R Markdown provides an authoring framework for data science. You can use a single R Markdown file to both
save and execute code
generate high quality reports that can be shared with an audience
R Markdown documents are fully reproducible and support dozens of static and dynamic output formats.
Here is a general R Markdown Workflow:
- Open a new .Rmd file in the RSutdio IDE by going ot File –>New File –> R Markdwon
- Embed code in chunks. Run code by line, by chunk, or all at once
- Write text and add tables, figures, images and citations. Format with Markdown syntax or the RStudio Visual Markdown Editor
- Set output format(s) and options in the YAML header. Customise themes or add parameters to execute or add interactivity with Shiny
- Save and render the whole document. Knit periodically to preview your work as you write
- Share your work
Let’s give it a go by creating an .Rmd file and look through some of the features together.
OK, first off: by opening a file, we are seeing the 4th pane of the RStudio console, which here is a text editor. This lets us dock and organize our files within RStudio instead of having a bunch of different windows open (but there are options to pop them out if that is what you prefer).
Let’s have a look at this file — it’s not blank; there is some initial text already provided for you. Let’s have a high-level look through of it:
The top part has the Title and Author we provided, as well as today’s date and the output type as an HTML document like we selected above.
There are white and grey sections. These are the 2 main languages that make up an RMarkdown file.
Grey sections are R code
White sections are Markdown text
There are several advantages to using RMarkdown, one of the most useful may be the ability to keep your code, notes, and relevant links all in one place. If you are interested you can find more information on Rmarkdown online here and a cheat sheet here.
R syntax and data types
Alright - now that we have our RProject and RMarkdown file ready to go, let’s dive into the syntax and data types in R.
Working directory
You may have heard the term “working directory” a few times now. The working directory where R will look for reading and writing your files. Since we are using RProjects, the working directory will be set to this location. There are also a few packages that can help with reproducibility and setting the working directory which we will discuss below. If you are ever curious where your working directory is set you can use the getwd() function to check.
Installing and loading packages
To take full advantage of R, you need to install R packages. R packages are loadable extensions that contain code, data, documentation, and tests in a standardized, share-able format that can easily be installed by R users.
Let’s install some packages now. You should only need to install a package once on your machine.
Once a package is install we need to load and attach the packages we would like to use in our script. Once a package is loaded using library(), you can access its functions and datasets directly without having to explicitly refer to the package name each time. This is because the functions and datasets from the package are attached to the search path of the R session.
library(here)here() starts at /Users/s2995128/Documents/GitHub/bush-heritage-R
library(tidyverse)── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.4
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.4.4 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.0
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Functions
Like Excel, the power of R comes not from doing small operations individually (like 8*22.3). R’s power comes from being able to operate on whole suites of numbers and datasets.
And also like Excel, some of the biggest power in R is that there are built-in functions that you can use in your analyses (and, as we’ll see, R users can easily create and share functions, and it is this open source developer and contributor community that makes R so awesome).
R has a mind-blowing collection of built-in functions that are used with the same syntax: function name with parentheses around what the function needs to do what it is supposed to do.
Functions always have the same structure: a name, parentheses, and arguments that you can specify. function_name(arguments). When we talk about function names, we use the convention function_name() (the name with empty parentheses), but in practice, we usually supply arguments to the function function_name(arguments) so that it works on some data. Let’s see a few more function examples.
Like in Excel, there is a function called “sum” to calculate a total. In R, it is spelled lowercase: sum(). (As I type in the Console, R will provide suggestions).
Another function is simply called c(); which combines values together.
So let’s create a new R code chunk. And we’ll write:
c(1, 7:9)[1] 1 7 8 9
So you can see that this combines these values all into the same place, which is called a vector here. We could also do this with a non-numeric examples, which are called “strings”:
c("Brisbane", "Sydney")[1] "Brisbane" "Sydney"
Getting help in R and online
Every function available to you should have a help page.
The help() function and ? help operator in R provide access to the documentation pages for R functions, data sets, and other objects, both for packages in the standard R distribution and for contributed packages. To access documentation for the here function that we installed earlier, for example, enter the command help(here) or help("here"), or ?here or ?"here" (i.e., the quotes are optional).
Learning R is an exercise in learning to google! There are so many places online to ask for help, and I have yet to personally run into an issue that someone else hasn’t had before. When searching for help try to be as specific as possible. Pasting error messages into google search, using the names of specific functions or packages, and adding “in R” to your searches can all help get you to a solution to your problem. Stackoverflow is a common place to find answers to R problems.
Data types in R
- Basic data types
Numeric:
(10.5, 55, 785)Integer:
(1L, 55L, 100L, where the "L" declares this as an integer)Complex:
(9+3i, where "i" is the imaginary part)Character/string:
("k", "R is exciting", "FALSE", "11.5")Logical/boolean:
(TRUE, FALSE)
- Vector:
c(1,2,3,"cat") - Matrix:
as.matrix(1:10) - Array:
array(1:3, c(2,4)) - List:
list(1,2,3) - Data-frame:
data.frame(x = 1, y = 1:10)
Data frames
A data frame is the representation of data in the format of a table where the columns are vectors that all have the same length. Because columns are vectors, each column must contain a single type of data (e.g., characters, integers, factors). For example, here is a figure depicting a data frame comprising a numeric, a character, and a logical vector.
Working with data frames
When we loaded the data into R, it got stored as an object of class tibble, which is a special kind of data frame (the difference is not important for our purposes, but you can learn more about tibbles here). Data frames are the de facto data structure for most tabular data, and what we use for statistics and plotting. Data frames can be created by hand, but most commonly they are generated by functions like read_csv(); in other words, when importing spreadsheets from your hard drive or the web.
Now let’s learn more about the tidyverse, which we will be using to wrangle and manipulate our data.
Read data into R
Now the real fun begins! Let’s read some data into R!
Here we can use the read_csv() function from the tidyverse package coupled with the here() function from the here package. The here function helps with reproducibility, particularly when sharing your code with collaborators. Instead of having to set unique file paths, as long as the RProject is set up the same, the here function automatically creates the filepath. Let’s try it out.
tree_health<-read_csv(here("Data", "veg-tree-soil", "M06_071_Tree_Health_Quadrat_Analysis_Join_Parent_Child.csv"))Rows: 8363 Columns: 41
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (25): Parent_ID, Prj_Cde, Project_Name, Site_ID, Unlisted_site, Site_Ta...
dbl (14): OID_, Site_Longitude_GDA2020, Site_Latitude_GDA2020, Year, Circum...
lgl (1): Emergents
dttm (1): Date
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Inspecting data frames
Once we read the data in, we may want to explore different attributes of the data and what the data looks like. There are several functions we can use to do this.
We can see the contents of the first few lines of the data by typing its name: tree_health. By default, this will show you as many rows and columns of the data as fit on your screen. If you wanted the first 50 rows, you could type print(tree_health, n = 50)
We can also extract the first few lines of this data using the function head():
head(tree_health)# A tibble: 6 × 41
OID_ Parent_ID Prj_Cde Project_Name Site_ID Unlisted_site Site_TargetName
<dbl> <chr> <chr> <chr> <chr> <chr> <chr>
1 1 {0EFAD49F-D0… <NA> <NA> <NA> <NA> <NA>
2 2 {1596E55D-C9… BROG Brogo Reser… BROG_0… <NA> <NA>
3 3 {1596E55D-C9… BROG Brogo Reser… BROG_0… <NA> <NA>
4 4 {1596E55D-C9… BROG Brogo Reser… BROG_0… <NA> <NA>
5 5 {1596E55D-C9… BROG Brogo Reser… BROG_0… <NA> <NA>
6 6 {1596E55D-C9… BROG Brogo Reser… BROG_0… <NA> <NA>
# ℹ 34 more variables: Site_Longitude_GDA2020 <dbl>,
# Site_Latitude_GDA2020 <dbl>, Staff_Name <chr>, Observation_Status <chr>,
# created_date_UTC <chr>, Date <dttm>, Year <dbl>, Scientific_Name <chr>,
# English_Name <chr>, Tree_Qudrat_species_not_on_list <chr>,
# Circumference <dbl>, Diameter <dbl>, Score <chr>, Browse_height <dbl>,
# Site_MGA_Zone <dbl>, Site_POINT_X <dbl>, Site_POINT_Y <dbl>,
# Site_NVIS_Maj_Veg_Subgroup <chr>, Site_PrjTargtID <chr>, …
Unlike the print() function, head() returns the extracted data. You could use it to assign the first 100 rows of surveys to an object using veg_sample <- head(tree_health, 100). This can be useful if you want to try out complex computations on a subset of your data before you apply them to the whole data set. There is a similar function that lets you extract the last few lines of the data set. It is called (you might have guessed it) tail().
To open the dataset in RStudio’s Data Viewer, use the view() function:
#View(tree_health)We can see what “type” of data tree_health is by calling
class(tree_health)[1] "spec_tbl_df" "tbl_df" "tbl" "data.frame"
As you can see tree_health is a data.frame. There are several data types we may encounter in R. These include:
We can see this also when inspecting the structure of a data frame with the function str():
str(tree_health)spc_tbl_ [8,363 × 41] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ OID_ : num [1:8363] 1 2 3 4 5 6 7 8 9 10 ...
$ Parent_ID : chr [1:8363] "{0EFAD49F-D096-4DCA-8578-3AD95A55B0E6}" "{1596E55D-C9E8-4E74-938F-712403E44179}" "{1596E55D-C9E8-4E74-938F-712403E44179}" "{1596E55D-C9E8-4E74-938F-712403E44179}" ...
$ Prj_Cde : chr [1:8363] NA "BROG" "BROG" "BROG" ...
$ Project_Name : chr [1:8363] NA "Brogo Reserve" "Brogo Reserve" "Brogo Reserve" ...
$ Site_ID : chr [1:8363] NA "BROG_001" "BROG_001" "BROG_001" ...
$ Unlisted_site : chr [1:8363] NA NA NA NA ...
$ Site_TargetName : chr [1:8363] NA NA NA NA ...
$ Site_Longitude_GDA2020 : num [1:8363] NA 150 150 150 150 ...
$ Site_Latitude_GDA2020 : num [1:8363] NA -36.5 -36.5 -36.5 -36.5 ...
$ Staff_Name : chr [1:8363] NA "Donna Belder" "Donna Belder" "Donna Belder" ...
$ Observation_Status : chr [1:8363] "No observation recorded. Assumed no observable data." "Observation recorded" "Observation recorded" "Observation recorded" ...
$ created_date_UTC : chr [1:8363] "13/08/2021 5:58:41" "10/11/2022 8:27:26" "10/11/2022 8:27:26" "10/11/2022 8:27:26" ...
$ Date : POSIXct[1:8363], format: NA "2022-11-08 11:55:00" ...
$ Year : num [1:8363] NA 2022 2022 2022 2022 ...
$ Scientific_Name : chr [1:8363] NA "Casuarina cunninghamiana" "Casuarina cunninghamiana" "Casuarina cunninghamiana" ...
$ English_Name : chr [1:8363] NA "Creek Oak, Fire Oak, River Oak" "Creek Oak, Fire Oak, River Oak" "Creek Oak, Fire Oak, River Oak" ...
$ Tree_Qudrat_species_not_on_list: chr [1:8363] NA NA NA NA ...
$ Circumference : num [1:8363] NA 40 119 101 80 91 63 72 69 48 ...
$ Diameter : num [1:8363] NA 12.7 37.9 32.1 25.5 ...
$ Score : chr [1:8363] NA "None or under 10%" "10-25%" "25-50%" ...
$ Browse_height : num [1:8363] NA NA NA NA NA NA NA NA NA NA ...
$ Site_MGA_Zone : num [1:8363] NA 55 55 55 55 55 55 55 55 55 ...
$ Site_POINT_X : num [1:8363] NA 749784 749784 749784 749784 ...
$ Site_POINT_Y : num [1:8363] NA 5956796 5956796 5956796 5956796 ...
$ Site_NVIS_Maj_Veg_Subgroup : chr [1:8363] NA "Eucalyptus woodlands with a tussock grass understorey" "Eucalyptus woodlands with a tussock grass understorey" "Eucalyptus woodlands with a tussock grass understorey" ...
$ Site_PrjTargtID : chr [1:8363] NA NA NA NA ...
$ Site_TargetID : chr [1:8363] NA NA NA NA ...
$ Comments : chr [1:8363] NA "Quadrat approximately 20X9 due to Brogo River" "Quadrat approximately 20X9 due to Brogo River" "Quadrat approximately 20X9 due to Brogo River" ...
$ Emergents : logi [1:8363] NA NA NA NA NA NA ...
$ Length_m : num [1:8363] 50 50 50 50 50 50 50 50 50 50 ...
$ Tree_Health_Quadrat_dimensions : chr [1:8363] "20m x 100m" "20m x 20m" "20m x 20m" "20m x 20m" ...
$ Tree_Health_Quadrat_location : chr [1:8363] "Right" "Left" "Left" "Left" ...
$ UTC_offset : num [1:8363] NA 11 11 11 11 11 11 11 11 11 ...
$ GlobalID : chr [1:8363] NA "{BC024DBA-5DF1-4707-BFBE-236587DF0422}" "{10BB9073-092F-4D8E-BF5F-37170B2FE467}" "{CEC8A6ED-59A4-4DF4-9988-3E4EF5F33BF2}" ...
$ Child_method_Tree_Health_Quad : chr [1:8363] NA "M06.071" "M06.071" "M06.071" ...
$ created_user : chr [1:8363] "justin.mccann@bushheritage.org.au" "Donna.Belder@bushheritage.org.au" "Donna.Belder@bushheritage.org.au" "Donna.Belder@bushheritage.org.au" ...
$ last_edited_user : chr [1:8363] "justin.mccann@bushheritage.org.au" "Donna.Belder@bushheritage.org.au" "Donna.Belder@bushheritage.org.au" "Donna.Belder@bushheritage.org.au" ...
$ last_edited_date_UTC : chr [1:8363] "13/08/2021 5:58:41" "10/11/2022 8:27:26" "10/11/2022 8:27:26" "10/11/2022 8:27:26" ...
$ Device_Longitude_GDA2020 : num [1:8363] 145 150 150 150 150 ...
$ Device_Latitude_GDA2020 : num [1:8363] -37.8 -36.5 -36.5 -36.5 -36.5 ...
$ Survey123_Edit_Link : chr [1:8363] "<a href=\"\"arcgis-survey123://?itemID=7e4c83dac4a2492199e311586946c2b3&folder=inbox&filter=GlobalID:0EFAD49F-D"| __truncated__ "<a href=\"\"arcgis-survey123://?itemID=7e4c83dac4a2492199e311586946c2b3&folder=inbox&filter=GlobalID:1596E55D-C"| __truncated__ "<a href=\"\"arcgis-survey123://?itemID=7e4c83dac4a2492199e311586946c2b3&folder=inbox&filter=GlobalID:1596E55D-C"| __truncated__ "<a href=\"\"arcgis-survey123://?itemID=7e4c83dac4a2492199e311586946c2b3&folder=inbox&filter=GlobalID:1596E55D-C"| __truncated__ ...
- attr(*, "spec")=
.. cols(
.. OID_ = col_double(),
.. Parent_ID = col_character(),
.. Prj_Cde = col_character(),
.. Project_Name = col_character(),
.. Site_ID = col_character(),
.. Unlisted_site = col_character(),
.. Site_TargetName = col_character(),
.. Site_Longitude_GDA2020 = col_double(),
.. Site_Latitude_GDA2020 = col_double(),
.. Staff_Name = col_character(),
.. Observation_Status = col_character(),
.. created_date_UTC = col_character(),
.. Date = col_datetime(format = ""),
.. Year = col_double(),
.. Scientific_Name = col_character(),
.. English_Name = col_character(),
.. Tree_Qudrat_species_not_on_list = col_character(),
.. Circumference = col_double(),
.. Diameter = col_double(),
.. Score = col_character(),
.. Browse_height = col_double(),
.. Site_MGA_Zone = col_double(),
.. Site_POINT_X = col_double(),
.. Site_POINT_Y = col_double(),
.. Site_NVIS_Maj_Veg_Subgroup = col_character(),
.. Site_PrjTargtID = col_character(),
.. Site_TargetID = col_character(),
.. Comments = col_character(),
.. Emergents = col_logical(),
.. Length_m = col_double(),
.. Tree_Health_Quadrat_dimensions = col_character(),
.. Tree_Health_Quadrat_location = col_character(),
.. UTC_offset = col_double(),
.. GlobalID = col_character(),
.. Child_method_Tree_Health_Quad = col_character(),
.. created_user = col_character(),
.. last_edited_user = col_character(),
.. last_edited_date_UTC = col_character(),
.. Device_Longitude_GDA2020 = col_double(),
.. Device_Latitude_GDA2020 = col_double(),
.. Survey123_Edit_Link = col_character()
.. )
- attr(*, "problems")=<externalptr>
We already saw how the functions head() and str() can be useful to check the content and the structure of a data frame. Here is a non-exhaustive list of functions to get a sense of the content/structure of the data. Let’s try them out!
Size:
dim(tree_health)- returns a vector with the number of rows in the first element, and the number of columns as the second element (the dimensions of the object)nrow(tree_health)- returns the number of rowsncol(tree_health)- returns the number of columns
Content:
head(tree_health)- shows the first 6 rowstail(tree_health)- shows the last 6 rows
Names:
names(tree_health)- returns the column names (synonym ofcolnames()fordata.frameobjects)rownames(tree_health)- returns the row names
Summary:
str(tree_health)- structure of the object and information about the class, length and content of each columnsummary(tree_health)- summary statistics for each column
Subsetting data frames
Our tree health data frame has rows and columns (it has 2 dimensions), if we want to extract some specific data from it, we need to specify the “coordinates” we want from it. Row numbers come first, followed by column numbers. However, note that different ways of specifying these coordinates lead to results with different classes.
We can extract specific values by specifying row and column indices in the format:
tree_health[1,1]# A tibble: 1 × 1
OID_
<dbl>
1 1
First row, sixth column:
tree_health[5,15]# A tibble: 1 × 1
Scientific_Name
<chr>
1 Casuarina cunninghamiana
We can also use shortcuts to select a number of rows or columns at once. To select all columns, leave the column index blank. For instance, to select all columns for the first row:
tree_health[1,]# A tibble: 1 × 41
OID_ Parent_ID Prj_Cde Project_Name Site_ID Unlisted_site Site_TargetName
<dbl> <chr> <chr> <chr> <chr> <chr> <chr>
1 1 {0EFAD49F-D0… <NA> <NA> <NA> <NA> <NA>
# ℹ 34 more variables: Site_Longitude_GDA2020 <dbl>,
# Site_Latitude_GDA2020 <dbl>, Staff_Name <chr>, Observation_Status <chr>,
# created_date_UTC <chr>, Date <dttm>, Year <dbl>, Scientific_Name <chr>,
# English_Name <chr>, Tree_Qudrat_species_not_on_list <chr>,
# Circumference <dbl>, Diameter <dbl>, Score <chr>, Browse_height <dbl>,
# Site_MGA_Zone <dbl>, Site_POINT_X <dbl>, Site_POINT_Y <dbl>,
# Site_NVIS_Maj_Veg_Subgroup <chr>, Site_PrjTargtID <chr>, …
To select the first column across all rows
tree_health[,4]# A tibble: 8,363 × 1
Project_Name
<chr>
1 <NA>
2 Brogo Reserve
3 Brogo Reserve
4 Brogo Reserve
5 Brogo Reserve
6 Brogo Reserve
7 Brogo Reserve
8 Brogo Reserve
9 Brogo Reserve
10 Brogo Reserve
# ℹ 8,353 more rows
An even shorter way to select first column across all rows is to not use a comma
tree_health[4]# A tibble: 8,363 × 1
Project_Name
<chr>
1 <NA>
2 Brogo Reserve
3 Brogo Reserve
4 Brogo Reserve
5 Brogo Reserve
6 Brogo Reserve
7 Brogo Reserve
8 Brogo Reserve
9 Brogo Reserve
10 Brogo Reserve
# ℹ 8,353 more rows
To select multiple rows or columns, use vectors! To select the first three rows of the 5th and 6th column.
tree_health[1:3, 3:4]# A tibble: 3 × 2
Prj_Cde Project_Name
<chr> <chr>
1 <NA> <NA>
2 BROG Brogo Reserve
3 BROG Brogo Reserve
This is equivalent to head_surveys <- head(surveys)
head_tree_health<-tree_health[1:6]As we’ve seen, when working with tibbles subsetting with single square brackets (“[]”) always returns a data frame. If you want a vector, use double square brackets (“[[]]”). For instance to get the third column as a vector:
tree_health[[3]]To get the first value in our data frame
tree_health[[1,1]][1] 1
Some other useful functions:
colnames(tree_health) [1] "OID_" "Parent_ID"
[3] "Prj_Cde" "Project_Name"
[5] "Site_ID" "Unlisted_site"
[7] "Site_TargetName" "Site_Longitude_GDA2020"
[9] "Site_Latitude_GDA2020" "Staff_Name"
[11] "Observation_Status" "created_date_UTC"
[13] "Date" "Year"
[15] "Scientific_Name" "English_Name"
[17] "Tree_Qudrat_species_not_on_list" "Circumference"
[19] "Diameter" "Score"
[21] "Browse_height" "Site_MGA_Zone"
[23] "Site_POINT_X" "Site_POINT_Y"
[25] "Site_NVIS_Maj_Veg_Subgroup" "Site_PrjTargtID"
[27] "Site_TargetID" "Comments"
[29] "Emergents" "Length_m"
[31] "Tree_Health_Quadrat_dimensions" "Tree_Health_Quadrat_location"
[33] "UTC_offset" "GlobalID"
[35] "Child_method_Tree_Health_Quad" "created_user"
[37] "last_edited_user" "last_edited_date_UTC"
[39] "Device_Longitude_GDA2020" "Device_Latitude_GDA2020"
[41] "Survey123_Edit_Link"
unique(tree_health$Project_Name) [1] NA
[2] "Brogo Reserve"
[3] "Burrin Burrin"
[4] "Liffey Valley Reserves and Landscape"
[5] "Dalyenong Reserves"
[6] "Charles Darwin Reserve"
[7] "Edgbaston"
[8] "Eurardy Reserve"
[9] "Fitz-Stirling (BHA Partnership)"
[10] "Hamelin Station Reserve"
[11] "Scottsdale Reserve - the traditional lands of the Ngunawal people"
[12] "Tarcutta Hills Reserve"
unique(tree_health$Year)[1] NA 2022 2023 2021
The above examples were using column and row indexing to filter rows, however this approach can run into problems for reproducibility and transparency in your data wrangling. While useful to know how R handles rows and columns in this way, there are more intuitive ways to subset your data in R using the tidyverse.
Introduction to the tidyverse
The tidyverse package is an “umbrella-package” that installs tidyr, dplyr, and several other useful packages for data analysis, such as ggplot2, tibble, etc.
The tidyverse package tries to address 3 common issues that arise when doing data analysis in R:
The results from a base R function sometimes depend on the type of data.
R expressions are used in a non standard way, which can be confusing for new learners.
The existence of hidden arguments having default operations that new learners are not aware of.
You should already have installed and loaded the tidyverse package. If you haven’t already done so, you can type install.packages("tidyverse") straight into the console. Then, type library(tidyverse) to load the package.
The suite of packages in the tidyverse follow the same principles and rely on a tidy data format. Tidy data is a standard way of mapping the meaning of a data to its structure as seen below. This format may be different than you may have previously used. We will explore how to transform data into a tidy format later in this course.
dypr and tidyr
The package dplyr provides helper tools for the most common data manipulation tasks. It is built to work directly with data frames, with many common tasks optimized by being written in a compiled language (C++). An additional feature is the ability to work directly with data stored in an external database. The benefits of doing this are that the data can be managed natively in a relational database, queries can be conducted on that database, and only the results of the query are returned.
This addresses a common problem with R in that all operations are conducted in-memory and thus the amount of data you can work with is limited by available memory. The database connections essentially remove that limitation in that you can connect to a database of many hundreds of GB, conduct queries on it directly, and pull back into R only what you need for analysis.
The package tidyr addresses the common problem of wanting to reshape your data for plotting and usage by different R functions. For example, sometimes we want data sets where we have one row per measurement. Other times we want a data frame where each measurement type has its own column, and rows are instead more aggregated groups (e.g., a time period, an experimental unit like a plot or a batch number). Moving back and forth between these formats is non-trivial, and tidyr gives you tools for this and more sophisticated data manipulation.
To learn more about dplyr and tidyr after the workshop, you may want to check out this handy data transformation with dplyr cheatsheet and this one about tidyr.Importing data into R
Wrangling data with the tidyverse
To select columns of a data frame, use select().
sub_tree_health<- select(tree_health, Project_Name, Site_ID, Scientific_Name, English_Name, Circumference, Diameter, Date)head(sub_tree_health)# A tibble: 6 × 7
Project_Name Site_ID Scientific_Name English_Name Circumference Diameter
<chr> <chr> <chr> <chr> <dbl> <dbl>
1 <NA> <NA> <NA> <NA> NA NA
2 Brogo Reserve BROG_001 Casuarina cunningh… Creek Oak, … 40 12.7
3 Brogo Reserve BROG_001 Casuarina cunningh… Creek Oak, … 119 37.9
4 Brogo Reserve BROG_001 Casuarina cunningh… Creek Oak, … 101 32.1
5 Brogo Reserve BROG_001 Casuarina cunningh… Creek Oak, … 80 25.5
6 Brogo Reserve BROG_001 Casuarina cunningh… Creek Oak, … 91 29.0
# ℹ 1 more variable: Date <dttm>
colnames(sub_tree_health)[1] "Project_Name" "Site_ID" "Scientific_Name" "English_Name"
[5] "Circumference" "Diameter" "Date"
To select all columns except certain ones, put a “-” in front of the variable to exclude it.
sub_tree_health<- select(tree_health, -OID_, -Parent_ID, -Site_Longitude_GDA2020)head(sub_tree_health)# A tibble: 6 × 38
Prj_Cde Project_Name Site_ID Unlisted_site Site_TargetName
<chr> <chr> <chr> <chr> <chr>
1 <NA> <NA> <NA> <NA> <NA>
2 BROG Brogo Reserve BROG_001 <NA> <NA>
3 BROG Brogo Reserve BROG_001 <NA> <NA>
4 BROG Brogo Reserve BROG_001 <NA> <NA>
5 BROG Brogo Reserve BROG_001 <NA> <NA>
6 BROG Brogo Reserve BROG_001 <NA> <NA>
# ℹ 33 more variables: Site_Latitude_GDA2020 <dbl>, Staff_Name <chr>,
# Observation_Status <chr>, created_date_UTC <chr>, Date <dttm>, Year <dbl>,
# Scientific_Name <chr>, English_Name <chr>,
# Tree_Qudrat_species_not_on_list <chr>, Circumference <dbl>, Diameter <dbl>,
# Score <chr>, Browse_height <dbl>, Site_MGA_Zone <dbl>, Site_POINT_X <dbl>,
# Site_POINT_Y <dbl>, Site_NVIS_Maj_Veg_Subgroup <chr>,
# Site_PrjTargtID <chr>, Site_TargetID <chr>, Comments <chr>, …
colnames(sub_tree_health) [1] "Prj_Cde" "Project_Name"
[3] "Site_ID" "Unlisted_site"
[5] "Site_TargetName" "Site_Latitude_GDA2020"
[7] "Staff_Name" "Observation_Status"
[9] "created_date_UTC" "Date"
[11] "Year" "Scientific_Name"
[13] "English_Name" "Tree_Qudrat_species_not_on_list"
[15] "Circumference" "Diameter"
[17] "Score" "Browse_height"
[19] "Site_MGA_Zone" "Site_POINT_X"
[21] "Site_POINT_Y" "Site_NVIS_Maj_Veg_Subgroup"
[23] "Site_PrjTargtID" "Site_TargetID"
[25] "Comments" "Emergents"
[27] "Length_m" "Tree_Health_Quadrat_dimensions"
[29] "Tree_Health_Quadrat_location" "UTC_offset"
[31] "GlobalID" "Child_method_Tree_Health_Quad"
[33] "created_user" "last_edited_user"
[35] "last_edited_date_UTC" "Device_Longitude_GDA2020"
[37] "Device_Latitude_GDA2020" "Survey123_Edit_Link"
To choose rows based on a specific criterion, use filter():
small_trees<-filter(tree_health, Circumference <40)
large_trees<-filter(tree_health, Circumference >=40)survey2022<-filter(tree_health, Year == 2022)Pipes
What if you want to select and filter at the same time? There are three ways to do this: use intermediate steps, nested functions, or pipes.
With intermediate steps, you create a temporary data frame and use that as input to the next function, like this:
sub_tree_health<-select(tree_health, Project_Name, Site_ID, Scientific_Name, English_Name, Circumference, Date)
small_trees<- filter(sub_tree_health, Circumference <40)This works, but the workflow is a bit clunky since we create data frames we do not necessary need. Another option is to use nested functions like this:
tree_health_sml <- select(filter(tree_health, Circumference < 40), Project_Name, Site_ID, Scientific_Name, English_Name, Circumference, Date)Again, this works! But it can be a bit hard to interpret and follow the steps between functions. Another option is using what we call pipes.
To make R code more human readable, the Tidyverse tools use the pipe, %>%, which was acquired from the magrittr package and is now part of the dplyr package that is installed automatically with Tidyverse. The pipe allows the output of a previous command to be used as input to another command instead of using nested functions.
Using pipes, selecting columns and filtering rows from our data looks like this:
small_trees<-tree_health %>%
filter(Circumference< 40) %>%
select(Project_Name, Site_ID, Scientific_Name, English_Name, Circumference, Date)Pretty schnazzy huh? The pipe represents a much easier way of writing and deciphering R code.
CHALLENGE
Create a data frame called large_trees that contains the same columns as the small_trees data frame and the trees with a circumference greater than or equal to 40.
large_trees<-tree_health %>%
filter(Circumference>= 40) %>%
select(Project_Name, Site_ID, Scientific_Name, English_Name, Circumference, Date)Manipulating data
Creating new columns
Just as we may want to remove certain columns from our dataset, we may also want to create new columns based on other rows or information.
Here let’s add a column to classify our tress as either “Small” or “Large” based on the Circumference column.
There are several ways to do this including an ifelse statement or using the case_when function. Let’s try both!
sub_tree_health_cat <- sub_tree_health %>%
mutate(Size_group = ifelse(Circumference <40, "Small", "Large"))OR
sub_tree_health_cat <- sub_tree_health %>%
mutate(Size_group = ifelse(Circumference >=40, "Large", "Small"))We always want to check our work and make sure that R produces what we expect.
head(sub_tree_health_cat)# A tibble: 6 × 7
Project_Name Site_ID Scientific_Name English_Name Circumference
<chr> <chr> <chr> <chr> <dbl>
1 <NA> <NA> <NA> <NA> NA
2 Brogo Reserve BROG_001 Casuarina cunninghamiana Creek Oak, Fire… 40
3 Brogo Reserve BROG_001 Casuarina cunninghamiana Creek Oak, Fire… 119
4 Brogo Reserve BROG_001 Casuarina cunninghamiana Creek Oak, Fire… 101
5 Brogo Reserve BROG_001 Casuarina cunninghamiana Creek Oak, Fire… 80
6 Brogo Reserve BROG_001 Casuarina cunninghamiana Creek Oak, Fire… 91
# ℹ 2 more variables: Date <dttm>, Size_group <chr>
Now with the case_when function:
sub_tree_health_cat2 <- sub_tree_health %>%
mutate(Size_group = case_when(Circumference <40 ~ "Small",
TRUE ~ "Large"))head(sub_tree_health_cat2)# A tibble: 6 × 7
Project_Name Site_ID Scientific_Name English_Name Circumference
<chr> <chr> <chr> <chr> <dbl>
1 <NA> <NA> <NA> <NA> NA
2 Brogo Reserve BROG_001 Casuarina cunninghamiana Creek Oak, Fire… 40
3 Brogo Reserve BROG_001 Casuarina cunninghamiana Creek Oak, Fire… 119
4 Brogo Reserve BROG_001 Casuarina cunninghamiana Creek Oak, Fire… 101
5 Brogo Reserve BROG_001 Casuarina cunninghamiana Creek Oak, Fire… 80
6 Brogo Reserve BROG_001 Casuarina cunninghamiana Creek Oak, Fire… 91
# ℹ 2 more variables: Date <dttm>, Size_group <chr>
Did it work as expected? Not quite! Trees with no circumference data get classified as “Large”! That’s not right. Good thing we checked, now let’s fix it:
sub_tree_health_cat2 <- sub_tree_health %>%
mutate(Size_group = case_when(Circumference <40 ~ "Small",
Circumference >=40 ~ "Large",
TRUE ~ NA))head(sub_tree_health_cat2)# A tibble: 6 × 7
Project_Name Site_ID Scientific_Name English_Name Circumference
<chr> <chr> <chr> <chr> <dbl>
1 <NA> <NA> <NA> <NA> NA
2 Brogo Reserve BROG_001 Casuarina cunninghamiana Creek Oak, Fire… 40
3 Brogo Reserve BROG_001 Casuarina cunninghamiana Creek Oak, Fire… 119
4 Brogo Reserve BROG_001 Casuarina cunninghamiana Creek Oak, Fire… 101
5 Brogo Reserve BROG_001 Casuarina cunninghamiana Creek Oak, Fire… 80
6 Brogo Reserve BROG_001 Casuarina cunninghamiana Creek Oak, Fire… 91
# ℹ 2 more variables: Date <dttm>, Size_group <chr>
There we go :) Now trees without circumferences values are not grouped into a category but instead also have an NA.
As I’m sure most of you know - missing data is common. Let’s explore how to deal with missing data in R.
Missing data
As you could see previously, some values in our data frame do not have a measurement for circumference. Many datasets will have NA values for important variables and it is important to handle missing data in the appropriate way for the question you aim to ask.
Case #1- NA is it’s own category
For the example above, perhaps we want to keep the NA values but have them be their own category. In this case, we would need to create a new category name in our Size_group column. As usual, there are multiple ways to do this.
sub_tree_health_cat<-sub_tree_health_cat
sub_tree_health_cat$Size_group[is.na(sub_tree_health_cat$Size_group) == TRUE]<-"Unknown"head(sub_tree_health_cat)# A tibble: 6 × 7
Project_Name Site_ID Scientific_Name English_Name Circumference
<chr> <chr> <chr> <chr> <dbl>
1 <NA> <NA> <NA> <NA> NA
2 Brogo Reserve BROG_001 Casuarina cunninghamiana Creek Oak, Fire… 40
3 Brogo Reserve BROG_001 Casuarina cunninghamiana Creek Oak, Fire… 119
4 Brogo Reserve BROG_001 Casuarina cunninghamiana Creek Oak, Fire… 101
5 Brogo Reserve BROG_001 Casuarina cunninghamiana Creek Oak, Fire… 80
6 Brogo Reserve BROG_001 Casuarina cunninghamiana Creek Oak, Fire… 91
# ℹ 2 more variables: Date <dttm>, Size_group <chr>
This works - but it makes it look complicated! Let’s try with the tidyverse.
sub_tree_health_cat<-sub_tree_health_cat %>%
mutate(Size_group = replace_na(Size_group, "Unknown"))head(sub_tree_health_cat)# A tibble: 6 × 7
Project_Name Site_ID Scientific_Name English_Name Circumference
<chr> <chr> <chr> <chr> <dbl>
1 <NA> <NA> <NA> <NA> NA
2 Brogo Reserve BROG_001 Casuarina cunninghamiana Creek Oak, Fire… 40
3 Brogo Reserve BROG_001 Casuarina cunninghamiana Creek Oak, Fire… 119
4 Brogo Reserve BROG_001 Casuarina cunninghamiana Creek Oak, Fire… 101
5 Brogo Reserve BROG_001 Casuarina cunninghamiana Creek Oak, Fire… 80
6 Brogo Reserve BROG_001 Casuarina cunninghamiana Creek Oak, Fire… 91
# ℹ 2 more variables: Date <dttm>, Size_group <chr>
That’s a bit better!
We could have also accounted from NA’s from the beginning in our case_when function like this:
sub_tree_health_cat <- sub_tree_health %>%
mutate(Size_group = case_when(Circumference <40 ~ "Small",
Circumference >=40 ~ "Large",
TRUE ~ "Unknown"))head(sub_tree_health_cat)# A tibble: 6 × 7
Project_Name Site_ID Scientific_Name English_Name Circumference
<chr> <chr> <chr> <chr> <dbl>
1 <NA> <NA> <NA> <NA> NA
2 Brogo Reserve BROG_001 Casuarina cunninghamiana Creek Oak, Fire… 40
3 Brogo Reserve BROG_001 Casuarina cunninghamiana Creek Oak, Fire… 119
4 Brogo Reserve BROG_001 Casuarina cunninghamiana Creek Oak, Fire… 101
5 Brogo Reserve BROG_001 Casuarina cunninghamiana Creek Oak, Fire… 80
6 Brogo Reserve BROG_001 Casuarina cunninghamiana Creek Oak, Fire… 91
# ℹ 2 more variables: Date <dttm>, Size_group <chr>
Perhaps the NA values should be removed from the analysis, which we could do in several ways:
sub_tree_health_narm<-sub_tree_health[complete.cases(sub_tree_health$Circumference),]head(sub_tree_health_narm)# A tibble: 6 × 6
Project_Name Site_ID Scientific_Name English_Name Circumference
<chr> <chr> <chr> <chr> <dbl>
1 Brogo Reserve BROG_001 Casuarina cunninghamiana Creek Oak, Fire… 40
2 Brogo Reserve BROG_001 Casuarina cunninghamiana Creek Oak, Fire… 119
3 Brogo Reserve BROG_001 Casuarina cunninghamiana Creek Oak, Fire… 101
4 Brogo Reserve BROG_001 Casuarina cunninghamiana Creek Oak, Fire… 80
5 Brogo Reserve BROG_001 Casuarina cunninghamiana Creek Oak, Fire… 91
6 Brogo Reserve BROG_001 Casuarina cunninghamiana Creek Oak, Fire… 63
# ℹ 1 more variable: Date <dttm>
length(which(is.na(sub_tree_health_narm) ==T))[1] 2096
sub_tree_health_narm<-sub_tree_health %>%
filter(is.na(Circumference != TRUE))OR
sub_tree_health_narm<-sub_tree_health %>%
filter(is.na(Circumference == FALSE))OR
sub_tree_health_narm<-sub_tree_health %>%
filter(!is.na(Circumference))CHALLENGE
There are some NA values for Scientific Names and Project Names. Remove rows without information on the Scientific Name.
sub_tree_health_narm<-sub_tree_health %>%
filter(!is.na(Scientific_Name),
!is.na(Project_Name))The summarise function
Now that we have done some data tidying, let’s start summarising some of our data using the summarise function in tidyverse
group_by() is often used together with summarize(), which collapses each group into a single-row summary of that group. group_by() takes as arguments the column names that contain the categorical variables for which you want to calculate the summary statistics. So to compute the mean Circumference by Site:
sub_tree_health_narm %>%
group_by(Project_Name) %>%
summarise(mean_cir = mean(Circumference, na.rm = TRUE))# A tibble: 11 × 2
Project_Name mean_cir
<chr> <dbl>
1 Brogo Reserve 72.0
2 Burrin Burrin 80.8
3 Charles Darwin Reserve 63.7
4 Dalyenong Reserves 70.5
5 Edgbaston 63.3
6 Eurardy Reserve 63.9
7 Fitz-Stirling (BHA Partnership) 27.6
8 Hamelin Station Reserve 39.7
9 Liffey Valley Reserves and Landscape 59
10 Scottsdale Reserve - the traditional lands of the Ngunawal people 80.0
11 Tarcutta Hills Reserve 37.1
I noticed that several Circumferences are listed as 0, maybe we want to remove these before we calculate Circumference
sub_tree_health_narm %>%
filter(Circumference>0) %>%
group_by(Project_Name) %>%
summarise(mean_cir = mean(Circumference, na.rm = TRUE))# A tibble: 11 × 2
Project_Name mean_cir
<chr> <dbl>
1 Brogo Reserve 72.0
2 Burrin Burrin 80.8
3 Charles Darwin Reserve 64.2
4 Dalyenong Reserves 70.5
5 Edgbaston 63.3
6 Eurardy Reserve 63.9
7 Fitz-Stirling (BHA Partnership) 27.6
8 Hamelin Station Reserve 39.7
9 Liffey Valley Reserves and Landscape 59
10 Scottsdale Reserve - the traditional lands of the Ngunawal people 80.0
11 Tarcutta Hills Reserve 37.1
Once the data are grouped, you can also summarize multiple variables at the same time (and not necessarily on the same variable). For instance, we could add a column indicating the minimum circumference for each species for each Site:
sub_tree_health_narm %>%
filter(Circumference>0) %>%
group_by(Project_Name, Scientific_Name) %>%
summarise(mean_cir = mean(Circumference, na.rm = TRUE),
min_cir = min(Circumference, na.rm = TRUE))`summarise()` has grouped output by 'Project_Name'. You can override using the
`.groups` argument.
# A tibble: 125 × 4
# Groups: Project_Name [11]
Project_Name Scientific_Name mean_cir min_cir
<chr> <chr> <dbl> <dbl>
1 Brogo Reserve Acacia implexa 77.6 46
2 Brogo Reserve Acacia mearnsii 54.1 36
3 Brogo Reserve Acronychia oblongifolia 43.0 32
4 Brogo Reserve Angophora floribunda 93 93
5 Brogo Reserve Casuarina cunninghamiana 69.9 35
6 Brogo Reserve Eucalyptus bosistoana 51.9 34.5
7 Brogo Reserve Eucalyptus globoidea 79.7 32
8 Brogo Reserve Eucalyptus globulus 38 38
9 Brogo Reserve Eucalyptus maidenii 180. 71
10 Brogo Reserve Eucalyptus tereticornis 132. 35
# ℹ 115 more rows
It is sometimes useful to rearrange the result of a query to inspect the values. For instance, we can sort on min_circ to put the smaller species first:
sub_tree_health_narm %>%
filter(Circumference>0) %>%
group_by(Project_Name, Scientific_Name) %>%
summarise(mean_cir = mean(Circumference, na.rm = TRUE),
min_cir = min(Circumference, na.rm = TRUE)) %>%
arrange(min_cir)`summarise()` has grouped output by 'Project_Name'. You can override using the
`.groups` argument.
# A tibble: 125 × 4
# Groups: Project_Name [11]
Project_Name Scientific_Name mean_cir min_cir
<chr> <chr> <dbl> <dbl>
1 Tarcutta Hills Reserve Eucalyptus albens 59.6 1
2 Tarcutta Hills Reserve Eucalyptus rossii 23.9 1
3 Tarcutta Hills Reserve Eucalyptus sideroxylon 71.3 1
4 Dalyenong Reserves Eucalyptus goniocalyx 66.4 1.04
5 Charles Darwin Reserve Eucalyptus salubris 45.8 1.22
6 Tarcutta Hills Reserve Eucalyptus blakelyi 45.3 2
7 Tarcutta Hills Reserve Eucalyptus sp. 34.6 2
8 Edgbaston Acacia cambagei 101. 3.6
9 Tarcutta Hills Reserve Eucalyptus polyanthemos 42.8 4
10 Tarcutta Hills Reserve Eucalyptus macrorhyncha 37.1 4.5
# ℹ 115 more rows
To sort in descending order, we need to add the desc() function. If we want to sort the results by decreasing order of mean circumference:
sub_tree_health_narm %>%
filter(Circumference>0) %>%
group_by(Project_Name, Scientific_Name) %>%
summarise(mean_cir = mean(Circumference, na.rm = TRUE),
min_cir = min(Circumference, na.rm = TRUE)) %>%
arrange(desc(mean_cir))`summarise()` has grouped output by 'Project_Name'. You can override using the
`.groups` argument.
# A tibble: 125 × 4
# Groups: Project_Name [11]
Project_Name Scientific_Name mean_cir min_cir
<chr> <chr> <dbl> <dbl>
1 Edgbaston Acacia cochloc… 361 361
2 Burrin Burrin Eucalyptus fas… 193. 92
3 Charles Darwin Reserve Eucalyptus cle… 184 184
4 Brogo Reserve Eucalyptus mai… 180. 71
5 Charles Darwin Reserve Eucalyptus sal… 156. 114
6 Burrin Burrin Eucalyptus vim… 150. 96
7 Tarcutta Hills Reserve Eucalyptus mel… 147 104
8 Scottsdale Reserve - the traditional lands … Eucalyptus mel… 132. 38
9 Brogo Reserve Eucalyptus ter… 132. 35
10 Eurardy Reserve Eucalyptus lox… 122. 33
# ℹ 115 more rows
CHALLENGE
You may have noticed that our previous data manipulation was not being stored anywhere! Create a new variable called sub_tree_health2 and summarise the data for mean, min, and max circumference for circumference values >0 by Site_ID, removing missing data for Project Name and Scientific Name
sub_tree_health2<-sub_tree_health_narm %>%
filter(Circumference>0,
!is.na(Project_Name),
!is.na(Scientific_Name)) %>%
group_by(Site_ID) %>%
summarise(mean_cir = mean(Circumference, na.rm =T),
min_cir = min(Circumference, na.rm = T),
max_cir = max(Circumference, na.rm = T))Counts
When working with data, we often want to know the number of observations found for each factor or combination of factors. For this task, dplyr provides count(). For example, if we wanted to count the number of rows of data for each species, we would do:
sub_tree_health_narm %>%
count(Scientific_Name)# A tibble: 107 × 2
Scientific_Name n
<chr> <int>
1 Acacia acanthoclada 1
2 Acacia anthochaera 2
3 Acacia argyrodendron 18
4 Acacia cambagei 19
5 Acacia cochlocarpa 1
6 Acacia dealbata 1
7 Acacia effusifolia 3
8 Acacia excelsa 2
9 Acacia implexa 7
10 Acacia mearnsii 22
# ℹ 97 more rows
If we want to count how many surveys were done in each project site we would enter:
sub_tree_health_narm %>%
count(Site_ID)# A tibble: 129 × 2
Site_ID n
<chr> <int>
1 BROG_001 28
2 BROG_002 27
3 BROG_003 25
4 BROG_004 39
5 BROG_006 32
6 BROG_007 1
7 BROG_008 10
8 BURR_001 11
9 BURR_002 12
10 BURR_003 9
# ℹ 119 more rows
The count() function is shorthand for something we’ve already seen: grouping by a variable, and summarizing it by counting the number of observations in that group. In other words, surveys %>% count() is equivalent to:
sub_tree_health_narm %>%
group_by(Site_ID) %>%
summarise(count = n())# A tibble: 129 × 2
Site_ID count
<chr> <int>
1 BROG_001 28
2 BROG_002 27
3 BROG_003 25
4 BROG_004 39
5 BROG_006 32
6 BROG_007 1
7 BROG_008 10
8 BURR_001 11
9 BURR_002 12
10 BURR_003 9
# ℹ 119 more rows
For convenience, count() provides the sort argument:
sub_tree_health_narm %>%
count(Scientific_Name, sort = TRUE)# A tibble: 107 × 2
Scientific_Name n
<chr> <int>
1 Eucalyptus macrorhyncha 3314
2 Eucalyptus rossii 1376
3 Eucalyptus sp. 521
4 Eucalyptus sideroxylon 347
5 Eucalyptus polyanthemos 279
6 Eucalyptus goniocalyx 262
7 Eucalyptus blakelyi 250
8 Eucalyptus albens 154
9 Eucalyptus camaldulensis 96
10 Eucalyptus astringens 73
# ℹ 97 more rows
Previous example shows the use of count() to count the number of rows/observations for one factor (i.e., Scientific_Name). If we wanted to count combination of factors, such as Site_ID and Scientific_Name, we would specify the first and the second factor as the arguments of count():
sub_tree_health_narm %>%
count(Site_ID, Scientific_Name)# A tibble: 339 × 3
Site_ID Scientific_Name n
<chr> <chr> <int>
1 BROG_001 Casuarina cunninghamiana 28
2 BROG_002 Eucalyptus globoidea 26
3 BROG_002 Eucalyptus tereticornis 1
4 BROG_003 Acacia mearnsii 1
5 BROG_003 Eucalyptus globoidea 21
6 BROG_003 Eucalyptus tereticornis 3
7 BROG_004 Acronychia oblongifolia 35
8 BROG_004 Eucalyptus globoidea 1
9 BROG_004 Eucalyptus maidenii 1
10 BROG_004 Eucalyptus tereticornis 2
# ℹ 329 more rows
With the above code, we can proceed with arrange() to sort the table according to a number of criteria so that we have a better comparison. For instance, we might want to arrange the table above in (i) an alphabetical order of the levels of the species and (ii) in descending order of the count:
sub_tree_health_narm %>%
count(Site_ID, Scientific_Name) %>%
arrange(Scientific_Name, desc(n))# A tibble: 339 × 3
Site_ID Scientific_Name n
<chr> <chr> <int>
1 CDR_063 Acacia acanthoclada 1
2 CDR_023 Acacia anthochaera 1
3 CDR_024 Acacia anthochaera 1
4 EDGB_010 Acacia argyrodendron 18
5 EDGB_054 Acacia cambagei 8
6 EDGB_008 Acacia cambagei 7
7 EDGB_044 Acacia cambagei 4
8 EDGB_007 Acacia cochlocarpa 1
9 LIFR_001 Acacia dealbata 1
10 CDR_063 Acacia effusifolia 2
# ℹ 329 more rows
CHALLENGE
How many trees are in the “Large” and “Small” circumference classes we created earlier in the tutorial?
count_size<-sub_tree_health_cat %>%
count(Size_group)count_size# A tibble: 3 × 2
Size_group n
<chr> <int>
1 Large 3399
2 Small 4942
3 Unknown 22
CHALLENGE 2
Which site has the largest circumference tree measured within each Project? Return the columns Project_Name, Site_ID, Scientific_Name, and Circumference.
max_circumference<-sub_tree_health_narm %>%
filter(Circumference>0) %>%
group_by(Project_Name) %>%
filter(Circumference == max(Circumference)) %>%
select(Project_Name, Site_ID, Scientific_Name, Circumference) view(max_circumference)Dealing with dates
A common issue that new (and experienced!) R users have is converting date and time information into a variable that is suitable for analyses. One way to store date information is to store each component of the date in a separate column. Using str(), we can see that currently our date is stored in the YYYYMMDD format.
str(sub_tree_health_narm)tibble [7,870 × 6] (S3: tbl_df/tbl/data.frame)
$ Project_Name : chr [1:7870] "Brogo Reserve" "Brogo Reserve" "Brogo Reserve" "Brogo Reserve" ...
$ Site_ID : chr [1:7870] "BROG_001" "BROG_001" "BROG_001" "BROG_001" ...
$ Scientific_Name: chr [1:7870] "Casuarina cunninghamiana" "Casuarina cunninghamiana" "Casuarina cunninghamiana" "Casuarina cunninghamiana" ...
$ English_Name : chr [1:7870] "Creek Oak, Fire Oak, River Oak" "Creek Oak, Fire Oak, River Oak" "Creek Oak, Fire Oak, River Oak" "Creek Oak, Fire Oak, River Oak" ...
$ Circumference : num [1:7870] 40 119 101 80 91 63 72 69 48 67 ...
$ Date : POSIXct[1:7870], format: "2022-11-08 11:55:00" "2022-11-08 11:55:00" ...
Maybe we want the year, month, and day to be in separate columns. To do this we can use the lubridate package. First let’s install and load the package.
#install.packages("lubridate")library(lubridate)The lubridate package has many useful functions for working with dates. These can help you extract dates from different string representations, convert between timezones, calculate time differences and more. You can find an overview of them in the lubridate cheat sheet.
sub_tree_health_yrs<-sub_tree_health_narm %>%
mutate(Year = year(Date),
Month = month(Date),
Day = day(Date))We can use the unique() function to see how many years are in our dataset
unique(sub_tree_health_yrs$Year)[1] 2022 2023 2021
Three years! We can also look through some our summarising functions now with the date!
CHALLENGE
How many surveys were done each year?
sub_tree_health_yrs %>%
count(Year)# A tibble: 3 × 2
Year n
<dbl> <int>
1 2021 264
2 2022 3886
3 2023 3720
table(sub_tree_health_yrs$Year)
2021 2022 2023
264 3886 3720
How many sites were surveyed in each year? (hint: use the unique() function combined with the length function)
sub_tree_health_yrs %>%
group_by(Year) %>%
summarise(count = length(unique(Site_ID)))# A tibble: 3 × 2
Year count
<dbl> <int>
1 2021 15
2 2022 34
3 2023 90
How many unique species were surveyed each year?
sub_tree_health_yrs %>%
group_by(Year) %>%
summarise(count = length(unique(Scientific_Name)))# A tibble: 3 × 2
Year count
<dbl> <int>
1 2021 15
2 2022 28
3 2023 87
Reshaping/pivoting data
Early we discussed the “tidy” data format where:
Each variable has its own column
Each observation has its own row
Each value must have its own cell
Each type of observational unit forms a table
Luckily, the data we have been working with thus far has already been in a tidy format. But this may not always be the case! Let’s look at how we can reshape data betwee what we call a long vs. wide format.
Pivoting from long to wide format
pivot_wider() takes three principal arguments:
the data
the names_from column variable whose values will become new column names.
the values_from column variable whose values will fill the new column variables.
Further arguments include values_fill which, if set, fills in missing values with the value provided.
Let’s use pivot_wider() to transform our dataset to find the mean circumference of each species in each plot over the entire survey period. We use filter(), group_by() and summarize() to filter our observations and variables of interest, and create a new variable for the mean_circumference.
sub_gw <- sub_tree_health_narm %>%
filter(Circumference>0) %>%
group_by(Site_ID, Scientific_Name) %>%
summarize(mean_cir = mean(Circumference))`summarise()` has grouped output by 'Site_ID'. You can override using the
`.groups` argument.
str(sub_gw)gropd_df [339 × 3] (S3: grouped_df/tbl_df/tbl/data.frame)
$ Site_ID : chr [1:339] "BROG_001" "BROG_002" "BROG_002" "BROG_003" ...
$ Scientific_Name: chr [1:339] "Casuarina cunninghamiana" "Eucalyptus globoidea" "Eucalyptus tereticornis" "Acacia mearnsii" ...
$ mean_cir : num [1:339] 69.9 77.8 35 39 74.5 ...
- attr(*, "groups")= tibble [129 × 2] (S3: tbl_df/tbl/data.frame)
..$ Site_ID: chr [1:129] "BROG_001" "BROG_002" "BROG_003" "BROG_004" ...
..$ .rows : list<int> [1:129]
.. ..$ : int 1
.. ..$ : int [1:2] 2 3
.. ..$ : int [1:3] 4 5 6
.. ..$ : int [1:4] 7 8 9 10
.. ..$ : int [1:7] 11 12 13 14 15 16 17
.. ..$ : int 18
.. ..$ : int [1:3] 19 20 21
.. ..$ : int [1:4] 22 23 24 25
.. ..$ : int [1:2] 26 27
.. ..$ : int [1:5] 28 29 30 31 32
.. ..$ : int [1:4] 33 34 35 36
.. ..$ : int [1:4] 37 38 39 40
.. ..$ : int [1:5] 41 42 43 44 45
.. ..$ : int [1:3] 46 47 48
.. ..$ : int [1:2] 49 50
.. ..$ : int [1:3] 51 52 53
.. ..$ : int [1:3] 54 55 56
.. ..$ : int [1:3] 57 58 59
.. ..$ : int [1:4] 60 61 62 63
.. ..$ : int 64
.. ..$ : int 65
.. ..$ : int [1:2] 66 67
.. ..$ : int 68
.. ..$ : int 69
.. ..$ : int 70
.. ..$ : int [1:3] 71 72 73
.. ..$ : int [1:3] 74 75 76
.. ..$ : int 77
.. ..$ : int [1:2] 78 79
.. ..$ : int 80
.. ..$ : int 81
.. ..$ : int 82
.. ..$ : int [1:3] 83 84 85
.. ..$ : int 86
.. ..$ : int 87
.. ..$ : int 88
.. ..$ : int [1:3] 89 90 91
.. ..$ : int [1:2] 92 93
.. ..$ : int [1:3] 94 95 96
.. ..$ : int [1:3] 97 98 99
.. ..$ : int [1:2] 100 101
.. ..$ : int 102
.. ..$ : int 103
.. ..$ : int [1:3] 104 105 106
.. ..$ : int 107
.. ..$ : int [1:2] 108 109
.. ..$ : int 110
.. ..$ : int [1:3] 111 112 113
.. ..$ : int [1:2] 114 115
.. ..$ : int [1:5] 116 117 118 119 120
.. ..$ : int [1:3] 121 122 123
.. ..$ : int [1:5] 124 125 126 127 128
.. ..$ : int 129
.. ..$ : int [1:4] 130 131 132 133
.. ..$ : int 134
.. ..$ : int [1:4] 135 136 137 138
.. ..$ : int [1:4] 139 140 141 142
.. ..$ : int [1:2] 143 144
.. ..$ : int [1:2] 145 146
.. ..$ : int 147
.. ..$ : int [1:2] 148 149
.. ..$ : int [1:3] 150 151 152
.. ..$ : int [1:4] 153 154 155 156
.. ..$ : int [1:3] 157 158 159
.. ..$ : int 160
.. ..$ : int [1:2] 161 162
.. ..$ : int [1:2] 163 164
.. ..$ : int 165
.. ..$ : int 166
.. ..$ : int 167
.. ..$ : int [1:3] 168 169 170
.. ..$ : int 171
.. ..$ : int 172
.. ..$ : int [1:2] 173 174
.. ..$ : int [1:2] 175 176
.. ..$ : int 177
.. ..$ : int 178
.. ..$ : int [1:3] 179 180 181
.. ..$ : int 182
.. ..$ : int [1:2] 183 184
.. ..$ : int 185
.. ..$ : int 186
.. ..$ : int [1:3] 187 188 189
.. ..$ : int 190
.. ..$ : int 191
.. ..$ : int [1:2] 192 193
.. ..$ : int [1:2] 194 195
.. ..$ : int [1:3] 196 197 198
.. ..$ : int [1:4] 199 200 201 202
.. ..$ : int [1:4] 203 204 205 206
.. ..$ : int [1:2] 207 208
.. ..$ : int [1:3] 209 210 211
.. ..$ : int [1:2] 212 213
.. ..$ : int [1:3] 214 215 216
.. ..$ : int 217
.. ..$ : int 218
.. ..$ : int 219
.. ..$ : int [1:15] 220 221 222 223 224 225 226 227 228 229 ...
.. ..$ : int [1:2] 235 236
.. .. [list output truncated]
.. ..@ ptype: int(0)
..- attr(*, ".drop")= logi TRUE
This yields sub_gw where the observations for each plot are distributed across multiple rows. Using pivot_wider() with the names from Scientific_Name and with values from mean_cir this becomes:
sub_wide<-sub_gw %>%
pivot_wider(names_from = Scientific_Name, values_from = mean_cir)There are a lot of NAs! We could fill them in:
sub_wide<-sub_gw %>%
pivot_wider(names_from = Scientific_Name, values_from = mean_cir, values_fill = 0)Can you think of any reasons you may to have the data in this wide format?
Pivoting from long to wide format
The opposing situation could occur if we had been provided with data in the form of sub_wide, where the scientific names are column names, but we wish to treat them as values of a genus variable instead.
In this situation we are reshaping the column names and turning them into a pair of new variables. One variable represents the column names as values, and the other variable contains the values previously associated with the column names.
pivot_longer() takes four principal arguments:
the data
the names_to column variable we wish to create from column names.
the values_to column variable we wish to create and fill with values.
cols are the name of the columns we use to make this pivot (or to drop).
In pivoting longer, we also need to specify what columns to reshape.
sub_long <- sub_wide %>% pivot_longer(names_to = "Scientific_Name", values_to = "mean_cir", cols = -Site_ID)
Joining data
In some cases, you may have two datasets with different information that you want to join and analyse together. There are several functions in the tidyverse to achieve this:
Let’s read in another dataset so we can explore different joins.
CHALLENGE
Read in the data found in the “bird” folder. Name it veg_quad and explore the dataset with some of the functions we learned earlier. Then
bird_survey<-read_csv(here("Data", "bird", "M05_050_Bird_Survey_Analysis_Join_Parent_Child.csv"))Rows: 19059 Columns: 59
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (40): Prj_Cde, Site_ID, Unlisted_site, Site_TargetName, Staff_Name, Add...
dbl (16): OID_, Site_Longitude_GDA2020, Site_Latitude_GDA2020, Year, Cloud_...
lgl (1): Legacy_Prop_DB_Distance
dttm (2): Date, Embedded_survey_end_time
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
sub_bird<-bird_survey %>%
dplyr::select(Site_ID, Date, Year, Birdlife_bird_count, Scientific_Name) %>%
filter(!is.na(Site_ID))bird_avg<-sub_bird %>%
group_by(Site_ID, Year) %>%
summarise(mean_bird_count = mean(Birdlife_bird_count, na.rm = T),
count_bird_species = length(unique(Scientific_Name)))`summarise()` has grouped output by 'Site_ID'. You can override using the
`.groups` argument.
tree_avg<-sub_tree_health_yrs %>%
group_by(Site_ID, Year) %>%
summarise(mean_cir = mean(Circumference, na.rm = T),
count_tree_species = length(unique(Scientific_Name)))`summarise()` has grouped output by 'Site_ID'. You can override using the
`.groups` argument.
We can now try a full_join which will keep all rows from both datasets. We can just list the two datasets we want to join, and R will search for the common column names.
tree_bird<-full_join(tree_avg, bird_avg)Joining with `by = join_by(Site_ID, Year)`
However, we could also specify the column names that we want to join.
tree_bird<-full_join(tree_avg, bird_avg, by = c("Site_ID", "Year"))It looks like there is a lot more bird data than tree data. If we know this is the case, we could try a left_join . This will keep all the rows of the tree data and matching rows from the bird data.
tree_bird<-left_join(tree_avg, bird_avg, by = c("Site_ID", "Year"))We still have some missing values. Let’s assume these are true missing data values and not 0s and remove them from our data
tree_bird<-tree_bird %>%
filter(!is.na(mean_bird_count))Great! We have now created a brand new dataset. Let’s look at how we could save this dataset.
Exporting data
Similar to the read_csv function, we can use the write_csv function to save our data outputs. We can also use the here function again to help with file paths.
write_csv(tree_bird, here("output_data", "Tree_bird_join.csv"))Let’s check and see if it is there.
Success!
Simple models
Before we move onto data visualisation, let’s run a quick test to see if tree species richness is correlated to bird species richness.
For this we will construct a linear regression using the lm function in the lme4 package. First let’s install and attach the package.
#install.packages("lme4")library(lme4)Loading required package: Matrix
Attaching package: 'Matrix'
The following objects are masked from 'package:tidyr':
expand, pack, unpack
We can look at the help page of the lm function to get more information on what arguments it requires.
Let’s subset our data to the most recent year
sub_tree_bird<-tree_bird %>%
filter(Year == 2023)Now let’s test whether tree species richness has an effect on bird species richness across sites and years:
tree_bird_reg<-lm(count_tree_species ~ count_bird_species, data = sub_tree_bird)summary(tree_bird_reg)
Call:
lm(formula = count_tree_species ~ count_bird_species, data = sub_tree_bird)
Residuals:
Min 1Q Median 3Q Max
-2.2601 -0.9437 -0.5184 0.7226 5.0007
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.10930 0.25660 4.323 5.31e-05 ***
count_bird_species 0.07416 0.00914 8.114 1.65e-11 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.489 on 66 degrees of freedom
Multiple R-squared: 0.4994, Adjusted R-squared: 0.4918
F-statistic: 65.84 on 1 and 66 DF, p-value: 1.654e-11
Wow! Higher tree species richness leads to higher bird species richness!
We can try plotting them using the base R function plot to see the relationship
plot(sub_tree_bird$count_tree_species, sub_tree_bird$count_bird_species)Oh no! There looks to be outliers that may be impacting our results. Let’s remove it and try again.
We can look for outliers using the boxplot function in base R.
boxplot(sub_tree_bird$count_tree_species)boxplot(sub_tree_bird$count_bird_species)sub_tree_bird2<-sub_tree_bird %>%
filter(count_tree_species<7,
count_bird_species<100)boxplot(sub_tree_bird2$count_tree_species)boxplot(sub_tree_bird2$count_bird_species)And running our model again gives us:
tree_bird_reg2<-lm(count_tree_species ~ count_bird_species, data = sub_tree_bird2)summary(tree_bird_reg2)
Call:
lm(formula = count_tree_species ~ count_bird_species, data = sub_tree_bird2)
Residuals:
Min 1Q Median 3Q Max
-1.6978 -1.0023 -0.4196 0.7277 3.8258
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.74869 0.36224 4.827 8.95e-06 ***
count_bird_species 0.03273 0.01817 1.801 0.0764 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.306 on 64 degrees of freedom
Multiple R-squared: 0.04824, Adjusted R-squared: 0.03336
F-statistic: 3.244 on 1 and 64 DF, p-value: 0.07642
Without the outliers, we no longer see a relationship between tree species richess and bird species richness.
plot(sub_tree_bird2$count_tree_species, sub_tree_bird2$count_bird_species)Perhaps we want to know whether tree species richness or bird species richness has changed through time
tree_year_reg<-lm(Year ~ count_tree_species, data = tree_bird)summary(tree_year_reg)
Call:
lm(formula = Year ~ count_tree_species, data = tree_bird)
Residuals:
Min 1Q Median 3Q Max
-1.6641 -0.6449 0.3359 0.3487 0.4255
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.023e+03 8.434e-02 23982.123 <2e-16 ***
count_tree_species -6.402e-03 2.498e-02 -0.256 0.798
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.5208 on 99 degrees of freedom
Multiple R-squared: 0.0006629, Adjusted R-squared: -0.009431
F-statistic: 0.06567 on 1 and 99 DF, p-value: 0.7983
bird_year_reg<-lm(Year ~ count_bird_species, data = tree_bird)summary(bird_year_reg)
Call:
lm(formula = Year ~ count_bird_species, data = tree_bird)
Residuals:
Min 1Q Median 3Q Max
-1.6497 -0.6482 0.3434 0.3510 0.3610
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.023e+03 7.960e-02 25411.593 <2e-16 ***
count_bird_species 7.665e-04 3.040e-03 0.252 0.801
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.5208 on 99 degrees of freedom
Multiple R-squared: 0.0006417, Adjusted R-squared: -0.009453
F-statistic: 0.06357 on 1 and 99 DF, p-value: 0.8015
We could look at outliers again, but it seems to be a low chance that there has been a change in species richness for either birds or trees in the short three year period that we tested.
If these sorts of questions and modelling tools are of interest to you - make sure to check out the Advanced course where we will dive into more detail and nuances!
A brief foray into for loops
For some analyses in R you may find yourself repeating the same code over and over again. This might be a good sign for using a for loop. A for loop is a way to automate your scripts and reduce repetition. Let’s look at a quick example. Christina will dive more into this tomorrow in the Advanced course.
Let’s have a look at our simple model above. We separated out the year 2023, but maybe we want to plot or run models each year. A for loop would be a good way to do this.
To run a for loop, one of the main components you need is the variable to loop through. In our case, we will loop through the Year column of the tree_bird data frame. First, let’s separate out our years into a vector so we can loop through it.
yrs<-unique(tree_bird$Year)
length(yrs)[1] 3
You can see we have 3 years in the tree_bird dataset, so the loop will run three times. Let’s make box plots of each year to explore our data using a for loop. We can set it up like this:
for(i in 1:length(yrs)){
print(i)
sub<-tree_bird %>%
filter(Year == yrs[[i]])
png(here(paste0("figures/tree_boxplot_", yrs[[i]],".png")), width = 350, height = 350)
boxplot(sub$count_tree_species)
dev.off()
png(here(paste0("figures/bird_boxplot_", yrs[[i]],".png")), width = 350, height = 350)
boxplot(sub$count_bird_species)
dev.off()
}[1] 1
[1] 2
[1] 3
In a for loop, you can run models, summarise and manipulate data, or do anything really! It’s really helpful when you want to perform the same analyses across years, sites, countries, etc.
Writing functions is another way to automate your code - but we will save that for another time.
Data visualisation with ggplot2
We’ve done all of the hard data wrangling and manipulation - now let’s make some pretty figures using the ggplot2 package in R.
You know what we need to do first - install and load the package!
#install.packages("ggplot2")library(ggplot2)Remember you only need to install a package once. After installation, future scripts can just use the library call to load the package into the R session.
ggplot2
ggplot2 is a plotting package that provides helpful commands to create complex plots from data in a data frame. It provides a more programmatic interface for specifying what variables to plot, how they are displayed, and general visual properties. Therefore, we only need minimal changes if the underlying data change or if we decide to change from a bar plot to a scatterplot. This helps in creating publication quality plots with minimal amounts of adjustments and tweaking.
ggplot2 refers to the name of the package itself. When using the package we use the function ggplot() to generate the plots, and so references to using the function will be referred to as ggplot() and the package as a whole as ggplot2
ggplot2 plots work best with data in the ‘tidy’, ‘long’ format, i.e., a column for every variable, and a row for every observation. Well-structured data will save you lots of time when making figures with ggplot2
ggplot graphics are built layer by layer by adding new elements. Adding layers in this fashion allows for extensive flexibility and customization of plots.
To build a ggplot, we will use the following basic template that can be used for different types of plots:
ggplot(data = <DATA>, mapping = aes(<MAPPINGS>)) + <GEOM_FUNCTION>()
- use the
ggplot()function and bind the plot to a specific data frame using thedataargument
ggplot(data = tree_bird)define an aesthetic mapping (using the aesthetic (
aes) function), by selecting the variables to be plotted and specifying how to present them in the graph, e.g., as x/y positions or characteristics such as size, shape, color, etc.ggplot(data = tree_bird, mapping = aes(x = count_tree_species, y = count_bird_species))
add ‘geoms’ – graphical representations of the data in the plot (points, lines, bars).
ggplot2offers many different geoms; we will use some common ones today, including:geom_point()for scatter plots, dot plots, etc.geom_boxplot()for, well, boxplots!geom_line()for trend lines, time series, etc.
To add a geom to the plot use + operator. Because we have two continuous variables, let’s use geom_point() first:
ggplot(data = tree_bird, mapping = aes(x = count_tree_species, y = count_bird_species)) +
geom_point()We can also add colors for points
ggplot(data = tree_bird, mapping = aes(x = count_tree_species, y = count_bird_species)) +
geom_point(alpha = 0.1, color = "blue")Or to color each species in the plot differently, you could use a vector as an input to the argument color. ggplot2 will provide a different color corresponding to different values in the vector. Here is an example where we color with Site_ID:
ggplot(data = tree_bird, mapping = aes(x = count_tree_species, y = count_bird_species)) +
geom_point(aes(color = as.factor(Year)))Challenge
Based on the tree health data where we classified trees as “small” or “large” based on circumference, plot tree circumference vs tree diameter and color the points by Size_group
ggplot(data = sub_tree_health_cat, aes(x = Site_ID, y = Circumference)) +
geom_point(aes(color = Size_group))Warning: Removed 22 rows containing missing values (`geom_point()`).
Boxplot
Let’s subset our tree health data to only include the top 10 most surveyed species
tree_survey_count<-sub_tree_health_cat %>%
filter(!is.na(Scientific_Name)) %>%
count(Scientific_Name) %>%
arrange(desc(n))
toi<-tree_survey_count %>%
filter(n>=73)
top_tree<-sub_tree_health_cat %>%
filter(Scientific_Name %in% toi$Scientific_Name)We can use boxplots to visualize the distribution of weight within each species:
ggplot(top_tree, aes(x = Scientific_Name, y = Circumference)) +
geom_boxplot()It’s a bit hard to read the x-axis labels, let’s change that
ggplot(top_tree, aes(x = Scientific_Name, y = Circumference)) +
geom_boxplot() +
coord_flip()ggplot(data = top_tree, mapping = aes(x = Scientific_Name, y = Circumference)) +
geom_boxplot(outlier.shape = NA) +
geom_jitter(alpha = 0.3, color = "tomato") +
coord_flip()CHALLENGE
Replace the boxplot with a violon plot (hing: see geom_violin
ggplot(data = top_tree, mapping = aes(x = Scientific_Name, y = Circumference)) +
geom_violin(outlier.shape = NA) +
geom_jitter(alpha = 0.3, color = "tomato") +
coord_flip()Warning in geom_violin(outlier.shape = NA): Ignoring unknown parameters:
`outlier.shape`
Plotting time series
soil_quad<-read_csv(here("Data", "veg-tree-soil", "M04_020_Soil_Surface_Quadrat_Analysis_Join_Parent_Child.csv"))Rows: 10141 Columns: 49
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (22): Parent_ID, Prj_Cde, Project_Name, Site_ID, Site_TargetName, Staff...
dbl (25): OID_, Site_Longitude_GDA2020, Site_Latitude_GDA2020, Year, Bare_g...
lgl (1): Emergents
dttm (1): Date
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
soil_quad<-soil_quad %>%
dplyr::select(Project_Name, Site_ID, Year, Quadrat_Function, Bare_ground_percent, Cryptogams_percent, Litter_percent, Fallen_branches_percent, Large_logs_percent, Plant_base_percent, Surface_rock_percent, Gravel_Gibber_Stones_percent) %>%
filter(!is.na(Site_ID))yearly_avg <- soil_quad %>%
group_by(Year) %>%
summarise_at(vars(Bare_ground_percent:Gravel_Gibber_Stones_percent), mean, na.rm = T)If you notice, this data is not in the “tidy” format. Let’s pivot it so that it will be easier to work with in ggplot2.
yearly_avg_long<-pivot_longer(data = yearly_avg, names_to = "Variable", values_to = "Avg_percent", cols = Bare_ground_percent:Gravel_Gibber_Stones_percent)Timelapse data can be visualized as a line plot with years on the x-axis and counts on the y-axis:
ggplot(data = yearly_avg_long, aes(x = Year, y = Avg_percent)) +
geom_line()Unfortunately, this does not work because we plotted data for all the genera together. We need to tell ggplot to draw a line for each genus by modifying the aesthetic function to include group = Variable:
ggplot(data = yearly_avg_long, aes(x = Year, y = Avg_percent, group = Variable)) +
geom_line()We will be able to distinguish genera in the plot if we add colors (using color also automatically groups the data):
ggplot(data = yearly_avg_long, aes(x = Year, y = Avg_percent, group = Variable, color = Variable)) +
geom_line()Faceting
ggplot has a special technique called faceting that allows the user to split one plot into multiple plots based on a factor included in the dataset. We will use it to make a time series plot for each genus:
ggplot(data = yearly_avg_long, aes(x = Year, y = Avg_percent)) +
geom_line() +
facet_wrap(facets = vars(Variable))You can facet by more than one variable. Let’s go back and add in the quadrat function column to facet our data by quadrat function and our cover type.
yearly_avg_fun <- soil_quad %>%
group_by(Year, Quadrat_Function) %>%
summarise_at(vars(Bare_ground_percent:Gravel_Gibber_Stones_percent), mean, na.rm = T)yearly_avg_fun_long<-pivot_longer(data = yearly_avg_fun, names_to = "Variable", values_to = "Avg_percent", cols = Bare_ground_percent:Gravel_Gibber_Stones_percent)ggplot(data = yearly_avg_fun_long, aes(x = Year, y = Avg_percent)) +
geom_line() +
facet_grid(rows = vars(Quadrat_Function), cols = vars(Variable))You can also organise the panels only by rows (or only by columns):
# One column, facet by row
ggplot(data = yearly_avg_fun_long,
mapping = aes(x = Year, y = Avg_percent, color = Quadrat_Function)) +
geom_line() +
facet_grid(rows = vars(Variable))Warning: Removed 2 rows containing missing values (`geom_line()`).
# One row, facet by column
ggplot(data = yearly_avg_fun_long,
mapping = aes(x = Year, y = Avg_percent, color = Quadrat_Function)) +
geom_line() +
facet_grid(cols = vars(Variable))Warning: Removed 2 rows containing missing values (`geom_line()`).
ggplot2 themes
Usually plots with white background look more readable when printed. Every single component of a ggplot graph can be customized using the generic theme() function, as we will see below. However, there are pre-loaded themes available that change the overall appearance of the graph without much effort.
For example, we can change our previous graph to have a simpler white background using the theme_bw() function:
ggplot(data = yearly_avg_fun_long, aes(x = Year, y = Avg_percent, col = Quadrat_Function)) +
geom_line() +
facet_wrap(facets = vars(Variable)) +
theme_bw()Warning: Removed 2 rows containing missing values (`geom_line()`).
In addition to theme_bw(), which changes the plot background to white, ggplot2 comes with several other themes which can be useful to quickly change the look of your visualization. The complete list of themes is available at https://ggplot2.tidyverse.org/reference/ggtheme.html. theme_minimal() and theme_light() are popular, and theme_void() can be useful as a starting point to create a new hand-crafted theme.
Customization
Take a look at the ggplot2 cheat sheet, and think of ways you could improve the plot.
Now, let’s change names of axes to something more informative than ‘year’ and ‘n’ and add a title to the figure:
ggplot(data = yearly_avg_fun_long, aes(x = Year, y = Avg_percent, col = Quadrat_Function)) +
geom_line() +
facet_wrap(facets = vars(Variable)) +
labs(x = "Year",
y = "Percent cover",
color = "Function") +
theme_bw()Warning: Removed 2 rows containing missing values (`geom_line()`).
ggplot(data = yearly_avg_fun_long, aes(x = Year, y = Avg_percent, col = Quadrat_Function)) +
geom_line() +
facet_wrap(facets = vars(Variable)) +
labs(x = "Year",
y = "Percent cover",
color = "Function") +
theme_bw() +
theme(text = element_text(size = 16))Warning: Removed 2 rows containing missing values (`geom_line()`).
Note that it is also possible to change the fonts of your plots. If you are on Windows, you may have to install the extrafont package, and follow the instructions included in the README for this package.
After our manipulations, you may notice that the values on the x-axis are still not properly readable. Let’s change the orientation of the labels and adjust them vertically and horizontally so they don’t overlap. You can use a 90 degree angle, or experiment to find the appropriate angle for diagonally oriented labels. We can also modify the facet label text (strip.text) to italicize the genus names:
ggplot(data = yearly_avg_fun_long, aes(x = Year, y = Avg_percent, col = Quadrat_Function)) +
geom_line() +
facet_wrap(facets = vars(Variable)) +
labs(x = "Year",
y = "Percent cover",
color = "Function") +
theme_bw() +
theme(axis.text.x = element_text(colour = "grey20", size = 12, angle = 90, hjust = 0.5, vjust = 0.5),
axis.text.y = element_text(colour = "grey20", size = 12),
strip.text = element_text(face = "italic"),
text = element_text(size = 16))Warning: Removed 2 rows containing missing values (`geom_line()`).
Arranging plots
Faceting is a great tool for splitting one plot into multiple plots, but sometimes you may want to produce a single figure that contains multiple plots using different variables or even different data frames. The patchwork package allows us to combine separate ggplots into a single figure while keeping everything aligned properly. Like most R packages, we can install patchwork from CRAN, the R package repository:
#install.packages("patchwork")library(patchwork)After you have loaded the patchwork package you can use + to place plots next to each other, / to arrange them vertically, and plot_layout() to determine how much space each plot uses. Note that you need to save the plots as new variables. They will not show up below now unless you type in the variable name.
perc_cover_plot<-ggplot(data = yearly_avg_long, aes(x = Year, y = Avg_percent)) +
geom_line() +
facet_wrap(facets = vars(Variable)) +
labs(x = "Year",
y = "Average percent cover")top_tree_plot<-ggplot(data = top_tree, mapping = aes(x = Scientific_Name, y = Circumference)) +
geom_boxplot(outlier.shape = NA) +
geom_jitter(alpha = 0.3, color = "tomato") +
coord_flip()perc_cover_plot+top_tree_plotYou can also use parentheses () to create more complex layouts. There are many useful examples on the patchwork website
Exporting plots
After creating your plot, you can save it to a file in your favorite format. The Export tab in the Plot pane in RStudio will save your plots at low resolution, which will not be accepted by many journals and will not scale well for posters. The ggplot2 extensions website provides a list of packages that extend the capabilities of ggplot2, including additional themes.
Instead, use the ggsave() function, which allows you to easily change the dimension and resolution of your plot by adjusting the appropriate arguments (width, height and dpi):
ggsave(here("figures", "Top_tree_boxplot.png"), top_tree_plot, width = 15, height = 10)If you use the ggsave function without specifying a plot name, it will save the last printed plot like this:
perc_cover_plot+top_tree_plotggsave(here("figures", "Joined_plot.png"),width = 20, height = 10)Voila!
The finish line
Phew! We have learned a lot today! Well done and we hope this has helped give you more confidence to tackle your next project in R. Remember - one of the greatest aspects of the R programming language is the community. There are loads of resources and ways to find help and grow your skills. As you continue to code you will only get better at trouble shooting problems, writing efficient code, and sufficiently documenting your scripts.
As a note - you should save your script periodically as you go to make sure you don’t lose any of your valuable work.
One more useful function before we go! If you ever want to clear your R environment you can use the following (but be careful as it will wipe all the variables you have created!).
rm(list=ls())It’s good practice to start with an empty environment to avoid mistakes.